Build the right model to predict the “Energy Star score” of a building using the provided data.
The NYC Benchmarking Law requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking. The law standardizes this process by requiring building owners to enter their annual energy and water use in the U.S. Environmental Protection Agency’s (EPA) online tool, Energy Star Portfolio Manager and use the tool to submit data. This data informs building owners about a building’s energy and water consumption compared to similar buildings, and tracks progress year over year to help in energy efficiency planning
An energy efficiency score is the Energy Star Rating that a building earns using the United States Environmental Protection Agency online benchmarking tool, Energy Star Portfolio Manager, to compare building energy performance to similar buildings in similar climates.
After going through the entire assignment I choose to work upon this problem as I am confident to complete this within the given timeframe of 2 days and also this problem will allow me to demonstrate my capabilities in a better way, which probably will also be advantageous for the team to decide upon my candidature.
My first objective is to get a good understanding of the dataset followed by exploratory data analysis, data sanity check, feature engineering, feature selection, model selection, stacking, scoring and any future work if required.
I will be highlighting all important notes and points wherever required, just like the way this line is highlighted
Since the target variable is bounded between 1 and 100 and total unique categories/levels for this variable is less than 100 hence I think tree based models like xgboost, gbm, RF can work better here because they partition the feature space and then aggregate the information at partitioned space level.
Lets deep dive and see whats there in the Data.
Loading R packages used besides base R.
library(knitr)
library(ggplot2)
library(plyr)
library(dplyr)
library(caret)
library(gridExtra)
library(scales)
library(randomForest)
library(psych)
library(xgboost)
library(readr)
library(data.table)
library(DT)
library(DataExplorer)
setwd("C:/Users/sysadmin/Downloads/Assignment/")
raw_data<-read_csv("Usecase1_Dataset.csv")
DT::datatable(head(raw_data,100))
Here I am trying to put my understanding of the data and what each column mean. I might be wrong but for the time being I would continue with my assumptions about the data. If my candidature goes forward we can discuss on these or any of the assumptions I would be taking in this entire notebook.
The information presented below is just a first impression of the data, we will keep on changing variable type based the kind of information it carries.
–> Also the very first glance at the data set I could see missing values being represented as “Not Available”, we will change it to missing or ‘0’ based on the information present in that variable
datatable(introduce(raw_data))
plot_intro(raw_data)
lets replace the ‘Not Available’ as missing information in our data set.
raw_data[raw_data=='Not Available']=NA
Now lets visualize the missing information in the data
# Custom function
plot_missing_2 <-
function (data, group = list(Good = 0.05, Okay = 0.4, Poor = 0.8,
Scarce = 1), geom_label_args = list(), title = NULL, ggtheme = theme_gray(),
theme_config = list(legend.position = c("bottom")))
{
pct_missing <- Band <- NULL
missing_value <- data.table(profile_missing(data))
group <- group[sort.list(unlist(group))]
invisible(lapply(seq_along(group), function(i) {
if (i == 1) {
missing_value[pct_missing <= group[[i]], `:=`(Band,
names(group)[i])]
} else {
missing_value[pct_missing > group[[i - 1]] & pct_missing <=
group[[i]], `:=`(Band, names(group)[i])]
}
}))
output <- ggplot(missing_value, aes_string(x = "feature",
y = "num_missing", fill = "Band")) + geom_bar(stat = "identity") +
scale_fill_manual("Band", values = c("Good"="green2","Okay"="gold","Poor"="darkorange","Scarce"="firebrick2")) + coord_flip() + xlab("Features") +
ylab("Missing Rows")
geom_label_args_list <- list(mapping = aes(label = paste0(round(100 *
pct_missing, 2), "%")))
output <- output + do.call("geom_label", c(geom_label_args_list,
geom_label_args))
class(output) <- c("single", class(output))
plotDataExplorer(plot_obj = output, title = title, ggtheme = ggtheme,
theme_config = theme_config)
}
plot_missing_2(raw_data)
There seems to be a lot of missing values we will handle them carefully in our next sections.
Here in this section we will be changing the data type of the variables which might have been read as a character/factor variable because of the presence of any unwanted string values but we will also have to be cautious while transforming some of the numeric columns to factors(It might result in a lot of factor levels)–>I mean binning.
datatable(str(raw_data))
## tibble [11,746 x 60] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Order : num [1:11746] 1 2 3 4 5 6 7 10 11 12 ...
## $ Property Id : num [1:11746] 13286 28400 4778226 4778267 4778288 ...
## $ Property Name : chr [1:11746] "201/205" "NYP Columbia (West Campus)" "MSCHoNY North" "Herbert Irving Pavilion & Millstein Hospital" ...
## $ Parent Property Id : chr [1:11746] "13286" "28400" "28400" "28400" ...
## $ Parent Property Name : chr [1:11746] "201/205" "NYP Columbia (West Campus)" "NYP Columbia (West Campus)" "NYP Columbia (West Campus)" ...
## $ BBL - 10 digits : chr [1:11746] "1013160001" "1021380040" "1021380030" "1021390001" ...
## $ NYC Borough, Block and Lot (BBL) self-reported : chr [1:11746] "1013160001" "1-02138-0040" "1-02138-0030" "1-02139-0001" ...
## $ NYC Building Identification Number (BIN) : chr [1:11746] "1037549" "1084198; 1084387;1084385; 1084386; 1084388; 1084389; 1807867; 1809824" "1063380" "1087281; 1076746" ...
## $ Address 1 (self-reported) : chr [1:11746] "201/205 East 42nd st." "622 168th Street" "3975 Broadway" "161 Fort Washington Ave" ...
## $ Address 2 : chr [1:11746] NA NA NA "177 Fort Washington Ave" ...
## $ Postal Code : chr [1:11746] "10017" "10032" "10032" "10032" ...
## $ Street Number : chr [1:11746] "675" "180" "3975" "161" ...
## $ Street Name : chr [1:11746] "3 AVENUE" "FT WASHINGTON AVENUE" "BROADWAY" "FT WASHINGTON AVENUE" ...
## $ Borough : chr [1:11746] "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
## $ DOF Gross Floor Area : num [1:11746] 289356 3693539 152765 891040 211400 ...
## $ Primary Property Type - Self Selected : chr [1:11746] "Office" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" ...
## $ List of All Property Use Types at Property : chr [1:11746] "Office" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" ...
## $ Largest Property Use Type : chr [1:11746] "Office" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" "Hospital (General Medical & Surgical)" ...
## $ Largest Property Use Type - Gross Floor Area (ft²) : chr [1:11746] "293447" "3889181" "231342" "1305748" ...
## $ 2nd Largest Property Use Type : chr [1:11746] NA NA NA NA ...
## $ 2nd Largest Property Use - Gross Floor Area (ft²) : chr [1:11746] NA NA NA NA ...
## $ 3rd Largest Property Use Type : chr [1:11746] NA NA NA NA ...
## $ 3rd Largest Property Use Type - Gross Floor Area (ft²) : chr [1:11746] NA NA NA NA ...
## $ Year Built : num [1:11746] 1963 1969 1924 1971 1932 ...
## $ Number of Buildings - Self-reported : num [1:11746] 2 12 1 1 1 12 1 4 1 1 ...
## $ Occupancy : num [1:11746] 100 100 100 100 100 100 60 100 100 100 ...
## $ Metered Areas (Energy) : chr [1:11746] "Whole Building" "Whole Building" NA NA ...
## $ Metered Areas (Water) : chr [1:11746] NA "Whole Building" NA NA ...
## $ ENERGY STAR Score : chr [1:11746] NA "55" NA NA ...
## $ Site EUI (kBtu/ft²) : chr [1:11746] "305.6" "229.8" NA NA ...
## $ Weather Normalized Site EUI (kBtu/ft²) : chr [1:11746] "303.1" "228.8" NA NA ...
## $ Weather Normalized Site Electricity Intensity (kWh/ft²) : chr [1:11746] "37.8" "24.8" NA NA ...
## $ Weather Normalized Site Natural Gas Intensity (therms/ft²): chr [1:11746] NA "2.4" NA NA ...
## $ Weather Normalized Source EUI (kBtu/ft²) : chr [1:11746] "614.2" "401.1" NA NA ...
## $ Fuel Oil #1 Use (kBtu) : chr [1:11746] NA NA NA NA ...
## $ Fuel Oil #2 Use (kBtu) : chr [1:11746] NA "1.96248472E7" NA NA ...
## $ Fuel Oil #4 Use (kBtu) : chr [1:11746] NA NA NA NA ...
## $ Fuel Oil #5 & 6 Use (kBtu) : chr [1:11746] NA NA NA NA ...
## $ Diesel #2 Use (kBtu) : chr [1:11746] NA NA NA NA ...
## $ District Steam Use (kBtu) : chr [1:11746] "5.15506751E7" "-3.914148026E8" NA NA ...
## $ Natural Gas Use (kBtu) : chr [1:11746] NA "933073441" NA NA ...
## $ Weather Normalized Site Natural Gas Use (therms) : chr [1:11746] NA "9330734.4" NA NA ...
## $ Electricity Use - Grid Purchase (kBtu) : chr [1:11746] "38139374.2" "332365924" NA NA ...
## $ Weather Normalized Site Electricity (kWh) : chr [1:11746] "1.10827705E7" "9.62613121E7" NA NA ...
## $ Total GHG Emissions (Metric Tons CO2e) : chr [1:11746] "6962.2" "55870.4" "0" "0" ...
## $ Direct GHG Emissions (Metric Tons CO2e) : chr [1:11746] "0" "51016.4" "0" "0" ...
## $ Indirect GHG Emissions (Metric Tons CO2e) : chr [1:11746] "6962.2" "4854.1" "0" "0" ...
## $ Property GFA - Self-Reported (ft²) : num [1:11746] 762051 3889181 231342 1305748 179694 ...
## $ Water Use (All Water Sources) (kgal) : chr [1:11746] NA NA NA NA ...
## $ Water Intensity (All Water Sources) (gal/ft²) : chr [1:11746] NA NA NA NA ...
## $ Source EUI (kBtu/ft²) : chr [1:11746] "619.4" "404.3" NA NA ...
## $ Release Date : chr [1:11746] "05/01/2017 05:32:03 PM" "04/27/2017 11:23:27 AM" "04/27/2017 11:23:27 AM" "04/27/2017 11:23:27 AM" ...
## $ Water Required? : chr [1:11746] "No" "No" "No" "No" ...
## $ DOF Benchmarking Submission Status : chr [1:11746] "In Compliance" "In Compliance" "In Compliance" "In Compliance" ...
## $ Latitude : num [1:11746] 40.8 40.8 40.8 40.8 40.8 ...
## $ Longitude : num [1:11746] -74 -73.9 -73.9 -73.9 -73.9 ...
## $ Community Board : num [1:11746] 6 12 12 12 12 8 8 13 13 13 ...
## $ Council District : num [1:11746] 4 10 10 10 10 5 5 23 23 23 ...
## $ Census Tract : num [1:11746] 88 251 251 255 255 ...
## $ NTA : chr [1:11746] "Turtle Bay-East Midtown" "Washington Heights South" "Washington Heights South" "Washington Heights South" ...
## - attr(*, "spec")=
## .. cols(
## .. Order = col_double(),
## .. `Property Id` = col_double(),
## .. `Property Name` = col_character(),
## .. `Parent Property Id` = col_character(),
## .. `Parent Property Name` = col_character(),
## .. `BBL - 10 digits` = col_character(),
## .. `NYC Borough, Block and Lot (BBL) self-reported` = col_character(),
## .. `NYC Building Identification Number (BIN)` = col_character(),
## .. `Address 1 (self-reported)` = col_character(),
## .. `Address 2` = col_character(),
## .. `Postal Code` = col_character(),
## .. `Street Number` = col_character(),
## .. `Street Name` = col_character(),
## .. Borough = col_character(),
## .. `DOF Gross Floor Area` = col_double(),
## .. `Primary Property Type - Self Selected` = col_character(),
## .. `List of All Property Use Types at Property` = col_character(),
## .. `Largest Property Use Type` = col_character(),
## .. `Largest Property Use Type - Gross Floor Area (ft²)` = col_character(),
## .. `2nd Largest Property Use Type` = col_character(),
## .. `2nd Largest Property Use - Gross Floor Area (ft²)` = col_character(),
## .. `3rd Largest Property Use Type` = col_character(),
## .. `3rd Largest Property Use Type - Gross Floor Area (ft²)` = col_character(),
## .. `Year Built` = col_double(),
## .. `Number of Buildings - Self-reported` = col_double(),
## .. Occupancy = col_double(),
## .. `Metered Areas (Energy)` = col_character(),
## .. `Metered Areas (Water)` = col_character(),
## .. `ENERGY STAR Score` = col_character(),
## .. `Site EUI (kBtu/ft²)` = col_character(),
## .. `Weather Normalized Site EUI (kBtu/ft²)` = col_character(),
## .. `Weather Normalized Site Electricity Intensity (kWh/ft²)` = col_character(),
## .. `Weather Normalized Site Natural Gas Intensity (therms/ft²)` = col_character(),
## .. `Weather Normalized Source EUI (kBtu/ft²)` = col_character(),
## .. `Fuel Oil #1 Use (kBtu)` = col_character(),
## .. `Fuel Oil #2 Use (kBtu)` = col_character(),
## .. `Fuel Oil #4 Use (kBtu)` = col_character(),
## .. `Fuel Oil #5 & 6 Use (kBtu)` = col_character(),
## .. `Diesel #2 Use (kBtu)` = col_character(),
## .. `District Steam Use (kBtu)` = col_character(),
## .. `Natural Gas Use (kBtu)` = col_character(),
## .. `Weather Normalized Site Natural Gas Use (therms)` = col_character(),
## .. `Electricity Use - Grid Purchase (kBtu)` = col_character(),
## .. `Weather Normalized Site Electricity (kWh)` = col_character(),
## .. `Total GHG Emissions (Metric Tons CO2e)` = col_character(),
## .. `Direct GHG Emissions (Metric Tons CO2e)` = col_character(),
## .. `Indirect GHG Emissions (Metric Tons CO2e)` = col_character(),
## .. `Property GFA - Self-Reported (ft²)` = col_double(),
## .. `Water Use (All Water Sources) (kgal)` = col_character(),
## .. `Water Intensity (All Water Sources) (gal/ft²)` = col_character(),
## .. `Source EUI (kBtu/ft²)` = col_character(),
## .. `Release Date` = col_character(),
## .. `Water Required?` = col_character(),
## .. `DOF Benchmarking Submission Status` = col_character(),
## .. Latitude = col_double(),
## .. Longitude = col_double(),
## .. `Community Board` = col_double(),
## .. `Council District` = col_double(),
## .. `Census Tract` = col_double(),
## .. NTA = col_character()
## .. )
raw_data<-as.data.frame(raw_data)
cols <- colnames(raw_data[,c(19,21,23,29:51)])
raw_data[cols] <- sapply(raw_data[cols],as.numeric)
raw_data$`ENERGY STAR Score`<-as.numeric(raw_data$`ENERGY STAR Score`)
datatable(raw_data[,c('NYC Borough, Block and Lot (BBL) self-reported','BBL - 10 digits','Borough')])
It seems that all these columns contains almost the same information. Probably the first digit in ‘NYC Borough, Block and Lot (BBL) self-reported’ contains the information about ‘Borough’ next five digits ‘Tax Block’ and last four ‘Tax Lot’. Since there are less missing values in ‘Borough’.
‘Address 1 (self-reported)’, ‘Address 2’, ‘Postal Code’,‘Street Name’, ‘Street Number’,‘Latitude’ and ‘Longitude’ contain information about the localization.I’ll keep only ‘Postal Code’ as it seems to be the most generic information about it, besides ‘Borough’. There may be additional or more general information about the location at ‘Census Tract’, ‘Community Board’, ‘Council District’, but for now I’ll ignore these COLUMNS.
‘Total GHG Emissions (Metric Tons CO2e)’=‘Direct GHG Emissions (Metric Tons CO2e)’+‘Indirect GHG Emissions (Metric Tons CO2e)’ — I’ll keep the total and the fraction of direct
‘Property GFA - Self-Reported (ft²)’=‘DOF Gross Floor Area’, so I’ll drop the ‘DOF Gross Floor Area’ that has missing values
‘Largest Property Use Type - Gross Floor Area (ft²)’,‘2nd Largest Property Use - Gross Floor Area (ft²)’ and ‘3rd Largest Property Use Type - Gross Floor Area (ft²)’ are part of ‘Property GFA - Self-Reported (ft²)’– I’ll create and keep the fractions they represent and drop the original values
raw_data<-as.data.frame(raw_data)
raw_data$largest_property_use_rate<-raw_data$`Largest Property Use Type - Gross Floor Area (ft²)`/raw_data$`Property GFA - Self-Reported (ft²)`
raw_data$second_property_use_rate <- raw_data$`2nd Largest Property Use - Gross Floor Area (ft²)`/raw_data$`Property GFA - Self-Reported (ft²)`
raw_data$third_property_use_rate <- raw_data$`3rd Largest Property Use Type - Gross Floor Area (ft²)`/raw_data$`Property GFA - Self-Reported (ft²)`
raw_data$Direct_GHG_Emissions_rate<-raw_data$`Direct GHG Emissions (Metric Tons CO2e)`/raw_data$`Total GHG Emissions (Metric Tons CO2e)`
raw_data[raw_data==""]=NA
raw_data[raw_data==" "]=NA
After the transformation and removal of blank spaces from the data, lets see if the missing information has changed from what we observed earlier:
plot_missing_2(raw_data)
There is a difference meaning there were blank spaces which has resulted in additional missing information in the data.
Lets treat the missing variables which are important to us:
raw_data$Block<-substr(raw_data$`BBL - 10 digits`,2,6)
raw_data$lot<-substr(raw_data$`BBL - 10 digits`,7,10)
library(tidyr)
zero_col = c('second_property_use_rate','third_property_use_rate','Water Intensity (All Water Sources) (gal/ft²)','Weather Normalized Site Natural Gas Intensity (therms/ft²)','Total GHG Emissions (Metric Tons CO2e)','Weather Normalized Site Electricity Intensity (kWh/ft²)','Direct_GHG_Emissions_rate','Total GHG Emissions (Metric Tons CO2e)')
plot_missing_2(raw_data[zero_col])
raw_data<-raw_data %>%
mutate_at(vars(zero_col), ~replace_na(., 0))
prop_col = c('2nd Largest Property Use Type','3rd Largest Property Use Type')
raw_data<-raw_data %>%
mutate_at(vars(prop_col), ~replace_na(., 'none'))
columns_to_drop = c('NYC Borough, Block and Lot (BBL) self-reported',
'NYC Building Identification Number (BIN)',
'BBL - 10 digits',
'Parent Property Name',
'Property Name',
'Address 1 (self-reported)',
'Address 2',
'Street Number',
'Street Name',
'Latitude',
'Longitude',
'DOF Gross Floor Area',
'DOF Benchmarking Submission Status',
'List of All Property Use Types at Property',
'Largest Property Use Type - Gross Floor Area (ft²)',
'2nd Largest Property Use - Gross Floor Area (ft²)',
'3rd Largest Property Use Type - Gross Floor Area (ft²)',
'Fuel Oil #1 Use (kBtu)',
'Fuel Oil #2 Use (kBtu)',
'Fuel Oil #4 Use (kBtu)',
'Fuel Oil #5 & 6 Use (kBtu)',
'Diesel #2 Use (kBtu)',
'District Steam Use (kBtu)',
'Natural Gas Use (kBtu)',
'Weather Normalized Site Natural Gas Use (therms)',
'Electricity Use - Grid Purchase (kBtu)',
'Weather Normalized Site Electricity (kWh)',
'Direct GHG Emissions (Metric Tons CO2e)',
'Indirect GHG Emissions (Metric Tons CO2e)',
'Property GFA - Self-Reported (ft²)',
'Water Use (All Water Sources) (kgal)',
'Weather Normalized Site EUI (kBtu/ft²)',
'Weather Normalized Source EUI (kBtu/ft²)')
raw_data1<-raw_data[!colnames(raw_data)%in%columns_to_drop]
All property type columns have a lot of levels we need to bin some of the levels which are not relevent/redundant/least frequent. There are many ways of binning levels of a factor variable but for now I would restrict myself to frequency based renaming and binning.
condenseMe <- function(vector, threshold = 0.002, newName = "Other") {
toCondense <- names(which(prop.table(table(vector)) < threshold))
vector[vector %in% toCondense] <- newName
vector
}
raw_data1$`Largest Property Use Type`<-condenseMe(raw_data1$`Largest Property Use Type`)
raw_data1$`2nd Largest Property Use Type`<-condenseMe(raw_data1$`2nd Largest Property Use Type`)
raw_data1$`3rd Largest Property Use Type`<-condenseMe(raw_data1$`3rd Largest Property Use Type`)
Lets check the unique values in these three columns. Now after frequency based binning the levels of these factors have reduced to 17,16,10 respectively
sapply(raw_data1, uniqueN)
## Order
## 11746
## Property Id
## 11746
## Parent Property Id
## 102
## Postal Code
## 286
## Borough
## 6
## Primary Property Type - Self Selected
## 55
## Largest Property Use Type
## 18
## 2nd Largest Property Use Type
## 16
## 3rd Largest Property Use Type
## 10
## Year Built
## 157
## Number of Buildings - Self-reported
## 49
## Occupancy
## 19
## Metered Areas (Energy)
## 8
## Metered Areas (Water)
## 7
## ENERGY STAR Score
## 101
## Site EUI (kBtu/ft²)
## 1959
## Weather Normalized Site Electricity Intensity (kWh/ft²)
## 440
## Weather Normalized Site Natural Gas Intensity (therms/ft²)
## 65
## Total GHG Emissions (Metric Tons CO2e)
## 7817
## Water Intensity (All Water Sources) (gal/ft²)
## 5606
## Source EUI (kBtu/ft²)
## 2920
## Release Date
## 3537
## Water Required?
## 3
## Community Board
## 20
## Council District
## 45
## Census Tract
## 808
## NTA
## 145
## largest_property_use_rate
## 3132
## second_property_use_rate
## 3434
## third_property_use_rate
## 1400
## Direct_GHG_Emissions_rate
## 10649
## Block
## 4068
## lot
## 483
Lets look at the individual distribution of target with respect to these frequent levels of property use type variable.
dat<-raw_data1[,c(7:9,15)]
colnames(dat)<-make.names(colnames(dat))
library(plotly)
fig <- plot_ly(dat, y = ~ENERGY.STAR.Score, color = ~Largest.Property.Use.Type, type = "box")
fig
library(plotly)
fig <- plot_ly(dat, y = ~ENERGY.STAR.Score, color = ~X2nd.Largest.Property.Use.Type, type = "box")
fig
library(plotly)
fig <- plot_ly(dat, y = ~ENERGY.STAR.Score, color = ~X3rd.Largest.Property.Use.Type, type = "box")
fig
fIRST lets see the proportion of missing values in all the columns
prop_NA <- apply(raw_data1, 2, function(x) round(sum(is.na(x))/length(x), digits = 4))
# ordenando os resultados
prop_NA[order(prop_NA)]
## Order
## 0.0000
## Property Id
## 0.0000
## Parent Property Id
## 0.0000
## Postal Code
## 0.0000
## Primary Property Type - Self Selected
## 0.0000
## 2nd Largest Property Use Type
## 0.0000
## 3rd Largest Property Use Type
## 0.0000
## Year Built
## 0.0000
## Number of Buildings - Self-reported
## 0.0000
## Occupancy
## 0.0000
## Weather Normalized Site Electricity Intensity (kWh/ft²)
## 0.0000
## Weather Normalized Site Natural Gas Intensity (therms/ft²)
## 0.0000
## Total GHG Emissions (Metric Tons CO2e)
## 0.0000
## Water Intensity (All Water Sources) (gal/ft²)
## 0.0000
## Release Date
## 0.0000
## second_property_use_rate
## 0.0000
## third_property_use_rate
## 0.0000
## Direct_GHG_Emissions_rate
## 0.0000
## Largest Property Use Type
## 0.0002
## largest_property_use_rate
## 0.0002
## Block
## 0.0009
## lot
## 0.0009
## Metered Areas (Energy)
## 0.0049
## Borough
## 0.0100
## Water Required?
## 0.0100
## Site EUI (kBtu/ft²)
## 0.0139
## Source EUI (kBtu/ft²)
## 0.0139
## ENERGY STAR Score
## 0.1791
## Community Board
## 0.1927
## Council District
## 0.1927
## Census Tract
## 0.1927
## NTA
## 0.1927
## Metered Areas (Water)
## 0.3924
Lets visualize these missing values to have an idea about where in the entire data are they missing
library(naniar)
gg_miss_upset(raw_data1)
vis_miss(raw_data1)
We can see proper intersection of rows. Most of the missing variables are having common missing rows.So we can tryout removing rows where the target variables is missing and see if it reduces the amount of missingness in other columns.
Removing observations where target variable is missing and also removing some more redundant variables like Metered Areas (Energy) & Metered Areas (Water) which has almost contant variance. Also removing water required? variable as there is no difference in the target variables distribution for yes or no values of water required?
raw_data1<-raw_data1[!is.na(raw_data1$`ENERGY STAR Score`),]
tapply(raw_data1$`ENERGY STAR Score`,raw_data1$`Water Required?`, summary)
## $No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 37.00 65.00 59.82 85.00 100.00
##
## $Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 36.0 65.0 59.8 85.0 100.0
remv<-c('Metered Areas (Energy)','Metered Areas (Water)','Release Date','Water Required?','Community Board','Council District','Census Tract','NTA')
raw_data1<-raw_data1[!colnames(raw_data1)%in% remv]
prop_NA <- apply(raw_data1, 2, function(x) round(sum(is.na(x))/length(x), digits = 4))
# ordenando os resultados
prop_NA[order(prop_NA)]
## Order
## 0.0000
## Property Id
## 0.0000
## Parent Property Id
## 0.0000
## Postal Code
## 0.0000
## Primary Property Type - Self Selected
## 0.0000
## Largest Property Use Type
## 0.0000
## 2nd Largest Property Use Type
## 0.0000
## 3rd Largest Property Use Type
## 0.0000
## Year Built
## 0.0000
## Number of Buildings - Self-reported
## 0.0000
## Occupancy
## 0.0000
## ENERGY STAR Score
## 0.0000
## Site EUI (kBtu/ft²)
## 0.0000
## Weather Normalized Site Electricity Intensity (kWh/ft²)
## 0.0000
## Weather Normalized Site Natural Gas Intensity (therms/ft²)
## 0.0000
## Total GHG Emissions (Metric Tons CO2e)
## 0.0000
## Water Intensity (All Water Sources) (gal/ft²)
## 0.0000
## Source EUI (kBtu/ft²)
## 0.0000
## largest_property_use_rate
## 0.0000
## second_property_use_rate
## 0.0000
## third_property_use_rate
## 0.0000
## Direct_GHG_Emissions_rate
## 0.0000
## Block
## 0.0002
## lot
## 0.0002
## Borough
## 0.0065
We can impute the missing variables using mode/model based imputation but since the amount of missingness is low I would rather remove them as I don’t have a lot of time left ;)
raw_data1<-raw_data1[complete.cases(raw_data1), ]
Removing columns which won’t be used in the modelling
selection = c('Postal Code','Parent Property Id','Property Id','lot','Block',
'3rd Largest Property Use Type','2nd Largest Property Use Type',
'Primary Property Type - Self Selected','Year Built')
raw_data1<-raw_data1[!colnames(raw_data1)%in% selection]
dmy <- dummyVars(" ~ .", data = raw_data1)
train_data <- data.frame(predict(dmy, newdata = raw_data1))
In this section I will be using Recursive feature elimination using RANDOM FOREST as a technique to gaze into the features which might be important for modelling
control <- rfeControl(functions=rfFuncs, method="cv", number=10,verbose =T)
# run the RFE algorithm
results <- readRDS("results.RDS")
results
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 2 18.25 0.6833 14.824 1.5612 0.07318 1.3243
## 4 19.97 0.7021 16.568 0.9259 0.01575 0.8294
## 8 14.05 0.7941 10.729 0.5411 0.01671 0.5487
## 10 13.20 0.8075 9.541 0.5367 0.01587 0.2136
## 15 12.80 0.8167 8.989 0.3975 0.01153 0.1742
## 31 12.60 0.8226 8.854 0.4194 0.01195 0.1924 *
##
## The top 5 variables (out of 31):
## X.Source.EUI..kBtu.ft².., X.Largest.Property.Use.Type.Office, X.Largest.Property.Use.Type.Non.Refrigerated.Warehouse, Direct_GHG_Emissions_rate, X.Largest.Property.Use.Type.Multifamily.Housing
It seems most of the variables we have considered are important for modelling(This is recursive feature elimination method using RF). An Rsq. value of 0.82 is good to start with and also with an MAE of 8.854 it looks like random forest could be a good method to use
It is also visible that using only 15 features out of these 31 features we might also get ~0.89 as MAE. We might want to go with only 15 features as more features would bring in the fear of overfitting
–>Lets look at the order in which the importance has been given by RFE:
results$optVariables
## [1] "X.Source.EUI..kBtu.ft².."
## [2] "X.Largest.Property.Use.Type.Office"
## [3] "X.Largest.Property.Use.Type.Non.Refrigerated.Warehouse"
## [4] "Direct_GHG_Emissions_rate"
## [5] "X.Largest.Property.Use.Type.Multifamily.Housing"
## [6] "X.Site.EUI..kBtu.ft².."
## [7] "X.Largest.Property.Use.Type.Supermarket.Grocery.Store"
## [8] "X.Total.GHG.Emissions..Metric.Tons.CO2e.."
## [9] "X.Largest.Property.Use.Type.Hospital..General.Medical...Surgical."
## [10] "X.Largest.Property.Use.Type.Distribution.Center"
## [11] "X.Weather.Normalized.Site.Electricity.Intensity..kWh.ft².."
## [12] "second_property_use_rate"
## [13] "X.Weather.Normalized.Site.Natural.Gas.Intensity..therms.ft².."
## [14] "largest_property_use_rate"
## [15] "X.Water.Intensity..All.Water.Sources...gal.ft².."
## [16] "BoroughManhattan"
## [17] "X.Largest.Property.Use.Type.Hotel"
## [18] "X.Largest.Property.Use.Type.Retail.Store"
## [19] "BoroughQueens"
## [20] "X.Largest.Property.Use.Type.Residence.Hall.Dormitory"
## [21] "BoroughBronx"
## [22] "X.Largest.Property.Use.Type.K.12.School"
## [23] "X.Largest.Property.Use.Type.Senior.Care.Community"
## [24] "third_property_use_rate"
## [25] "BoroughBrooklyn"
## [26] "X.Largest.Property.Use.Type.Other"
## [27] "X.Largest.Property.Use.Type.Medical.Office"
## [28] "X.Number.of.Buildings...Self.reported."
## [29] "Occupancy"
## [30] "BoroughStaten.Island"
## [31] "X.Largest.Property.Use.Type.Parking"
–>Its good to see that 3 of our feature engineered variables is among the top selected features– “Direct_GHG_Emissions_rate”,“second_property_use_rate” & “largest_property_use_rate”
This section is just to understand which model can be a better choice for solving this problem. A stacked model genrally is better than independent set of models because stacking takes the advantage of modelling variations from different models.
library(caret)
library(devtools)
library(caretEnsemble)
library(mlbench)
library(rsample)
library(dplyr)
#Split train/test
set.seed(123)
library(caTools)
split = sample.split(train_data$X.ENERGY.STAR.Score., SplitRatio = 0.8)
tr = subset(train_data, split == TRUE)
test = subset(train_data, split == FALSE)
folds=5
repeats=1
myControl <- trainControl(method='repeatedcv', number=folds, repeats=repeats, returnResamp='none',
returnData=FALSE, savePredictions=TRUE,
verboseIter=TRUE, allowParallel=TRUE,
index=createMultiFolds(train_data$X.ENERGY.STAR.Score., k=folds, times=repeats))
PP <- c('center', 'scale')
#cl <- makePSOCKcluster(10)
#registerDoParallel(cl)
#Train some models
##GBM
model1 <- readRDS("model1.RDS")
#KNN
model3 <- readRDS("model3.RDS")
model4 <- readRDS("model4.RDS")
model5 <- readRDS("model5.RDS")
model6 <- readRDS("model6.RDS")
model7 <- readRDS("model7.RDS")
#Make a list of all the models
all.models <- list(model1, model3, model4, model5,model6,model7)
names(all.models) <- sapply(all.models, function(x) x$method)
sort(sapply(all.models, function(x) max(x$results$Rsquared)))
## glmboost glmnet knn cubist gbm
## 0.04436680 0.04440788 0.32136178 0.82354712 0.82696916
#Make a linear/gbm/rf ensemble
results <- readRDS("results.RDS")
gbm <- readRDS("gbm.RDS")
rf<- readRDS("gbm.RDS")
max(gbm$error$Rsquared)
## [1] 0.8295693
#Predict for test set:
#preds <- data.frame(sapply(all.models, predict, newdata=test[,-c(1,23)]))
#preds$ENS_gbm <- predict(gbm, newdata=test[,-c(1,23)])
#preds$ENS_rf <- predict(rf, newdata=test[,-c(1,23)])
#preds$ENS_glmnet <- predict(ak,newdata=test[,-c(1,23)])
Rsquare value for our stacked gbm model is slightly higher than individual model which also means that the stacked model is working a little better but we also need to consider other type of models to enhance this
I have neglected a lot of things which I can do but just for the sake of saving time I have not mentioned them above so I would like to put them in future work:
Model Explainability for black box models
Trying out other ways of missing value imputation and its effects on accuracy metrics
Feature distribution and outlier detection with respect to univariate and multivariatez data(DBSCAN)
Feature engineering with respect to other columns
Stacking with more models